Albert Liu

## [1] "/Users/Liu/Self-learning/DataAnalytics/project4"

Univariate Plots Section

First let’s run some basic functions to have a picture of the dataset. Our dataset consists of 13 variables, with almost 1,599 observations.

## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000

Since we are primarily interested in the quality of the red wine, let’s see the statistics of it first. We can find that the quality is between 3 and 8.

Wine Quality

According to the explanation in the wineQualityinfor.txt, we know the score of wine is between 0 anf 10. And the data we get is only between 3 and 8, it’s better to category the wine into 3 kinds: - bad (score: 3 or 4); - average (score: 5 or 6); - good (score: 7 or 8).

Distributions and Outliers

## Warning: Transformation introduced infinite values in continuous x-axis
## Warning: Removed 132 rows containing non-finite values (stat_bin).

As the warning information of the plot said, there are 132 values(non-finite values) removed. So let’s have a check:

length(subset(df, citric.acid == 0)$citric.acid)
## [1] 132

It seems that there are 132 wines’ value of citric acidity are zero. So it’s not strange that the log10 plot above does not show these values.

Univariate Analysis

What is the structure of your dataset?

There are 1599 wines in the dataset with 12 features (fixed acidity, volatile acidity, citric acidity, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol, quality)

(worst) ——-> (best)

quality: 3, 4, 5, 6, 7, 8

Other observations:

  • The median quality is 6.
  • Most wines have less than 0.1 chlorides g / dm^3

What is/are the main feature(s) of interest in your dataset?

The main feature in the data set is quality. I???d like to determine which features determine the quality of wines.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

The variables related to acisity (ficed, volatile, citric, pH) may influence the taste of wines. Residual sugar, which indicates the sweetness of the wine, may also play an important role.

Did you create any new variables from existing variables in the dataset?

I created a rating variable to benefit the later visualization. Also, I find fixed.acidity, volatile.acidity, citric.acid are all about the acidity of the wine, so I create a new variable called FVC, which adds up these three values. Here it is:

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Yes. I log-transformed the right skewed fixed.acidity, volatile.acidity, citric.acid and alcohol.

Bivariate Plots Section

## 
## Two-Step Estimates
## 
## Correlations/Type of Correlation:
##                      fixed.acidity volatile.acidity citric.acid
## fixed.acidity                    1          Pearson     Pearson
## volatile.acidity           -0.2561                1     Pearson
## citric.acid                 0.6717          -0.5525           1
## residual.sugar              0.1148         0.001918      0.1436
## chlorides                  0.09371           0.0613      0.2038
## free.sulfur.dioxide        -0.1538          -0.0105    -0.06098
## total.sulfur.dioxide       -0.1132          0.07647     0.03553
## density                      0.668          0.02203      0.3649
## pH                          -0.683           0.2349     -0.5419
## sulphates                    0.183           -0.261      0.3128
## alcohol                   -0.06167          -0.2023      0.1099
## quality                     0.1241          -0.3906      0.2264
## FVC.acidity                 0.9964          -0.2044      0.6904
##                      residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity               Pearson   Pearson             Pearson
## volatile.acidity            Pearson   Pearson             Pearson
## citric.acid                 Pearson   Pearson             Pearson
## residual.sugar                    1   Pearson             Pearson
## chlorides                   0.05561         1             Pearson
## free.sulfur.dioxide           0.187  0.005562                   1
## total.sulfur.dioxide          0.203    0.0474              0.6677
## density                      0.3553    0.2006            -0.02195
## pH                         -0.08565    -0.265             0.07038
## sulphates                  0.005527    0.3713             0.05166
## alcohol                     0.04208   -0.2211            -0.06941
## quality                     0.01373   -0.1289            -0.05066
## FVC.acidity                  0.1245    0.1167             -0.1536
##                      total.sulfur.dioxide density       pH sulphates
## fixed.acidity                     Pearson Pearson  Pearson   Pearson
## volatile.acidity                  Pearson Pearson  Pearson   Pearson
## citric.acid                       Pearson Pearson  Pearson   Pearson
## residual.sugar                    Pearson Pearson  Pearson   Pearson
## chlorides                         Pearson Pearson  Pearson   Pearson
## free.sulfur.dioxide               Pearson Pearson  Pearson   Pearson
## total.sulfur.dioxide                    1 Pearson  Pearson   Pearson
## density                           0.07127       1  Pearson   Pearson
## pH                               -0.06649 -0.3417        1   Pearson
## sulphates                         0.04295  0.1485  -0.1966         1
## alcohol                           -0.2057 -0.4962   0.2056   0.09359
## quality                           -0.1851 -0.1749 -0.05773    0.2514
## FVC.acidity                      -0.09628  0.6756  -0.6835    0.1816
##                       alcohol quality FVC.acidity
## fixed.acidity         Pearson Pearson     Pearson
## volatile.acidity      Pearson Pearson     Pearson
## citric.acid           Pearson Pearson     Pearson
## residual.sugar        Pearson Pearson     Pearson
## chlorides             Pearson Pearson     Pearson
## free.sulfur.dioxide   Pearson Pearson     Pearson
## total.sulfur.dioxide  Pearson Pearson     Pearson
## density               Pearson Pearson     Pearson
## pH                    Pearson Pearson     Pearson
## sulphates             Pearson Pearson     Pearson
## alcohol                     1 Pearson     Pearson
## quality                0.4762       1     Pearson
## FVC.acidity          -0.06667  0.1038           1
##                      fixed.acidity volatile.acidity citric.acid
## fixed.acidity           1.00000000     -0.256130895  0.67170343
## volatile.acidity       -0.25613089      1.000000000 -0.55249568
## citric.acid             0.67170343     -0.552495685  1.00000000
## residual.sugar          0.11477672      0.001917882  0.14357716
## chlorides               0.09370519      0.061297772  0.20382291
## free.sulfur.dioxide    -0.15379419     -0.010503827 -0.06097813
## total.sulfur.dioxide   -0.11318144      0.076470005  0.03553302
## density                 0.66804729      0.022026232  0.36494718
## pH                     -0.68297819      0.234937294 -0.54190414
## sulphates               0.18300566     -0.260986685  0.31277004
## alcohol                -0.06166827     -0.202288027  0.10990325
## quality                 0.12405165     -0.390557780  0.22637251
## FVC.acidity             0.99638446     -0.204350914  0.69043814
##                      residual.sugar    chlorides free.sulfur.dioxide
## fixed.acidity           0.114776724  0.093705186        -0.153794193
## volatile.acidity        0.001917882  0.061297772        -0.010503827
## citric.acid             0.143577162  0.203822914        -0.060978129
## residual.sugar          1.000000000  0.055609535         0.187048995
## chlorides               0.055609535  1.000000000         0.005562147
## free.sulfur.dioxide     0.187048995  0.005562147         1.000000000
## total.sulfur.dioxide    0.203027882  0.047400468         0.667666450
## density                 0.355283371  0.200632327        -0.021945831
## pH                     -0.085652422 -0.265026131         0.070377499
## sulphates               0.005527121  0.371260481         0.051657572
## alcohol                 0.042075437 -0.221140545        -0.069408354
## quality                 0.013731637 -0.128906560        -0.050656057
## FVC.acidity             0.124487746  0.116674670        -0.153614137
##                      total.sulfur.dioxide     density          pH
## fixed.acidity                 -0.11318144  0.66804729 -0.68297819
## volatile.acidity               0.07647000  0.02202623  0.23493729
## citric.acid                    0.03553302  0.36494718 -0.54190414
## residual.sugar                 0.20302788  0.35528337 -0.08565242
## chlorides                      0.04740047  0.20063233 -0.26502613
## free.sulfur.dioxide            0.66766645 -0.02194583  0.07037750
## total.sulfur.dioxide           1.00000000  0.07126948 -0.06649456
## density                        0.07126948  1.00000000 -0.34169933
## pH                            -0.06649456 -0.34169933  1.00000000
## sulphates                      0.04294684  0.14850641 -0.19664760
## alcohol                       -0.20565394 -0.49617977  0.20563251
## quality                       -0.18510029 -0.17491923 -0.05773139
## FVC.acidity                   -0.09627567  0.67559618 -0.68348382
##                         sulphates     alcohol     quality FVC.acidity
## fixed.acidity         0.183005664 -0.06166827  0.12405165  0.99638446
## volatile.acidity     -0.260986685 -0.20228803 -0.39055778 -0.20435091
## citric.acid           0.312770044  0.10990325  0.22637251  0.69043814
## residual.sugar        0.005527121  0.04207544  0.01373164  0.12448775
## chlorides             0.371260481 -0.22114054 -0.12890656  0.11667467
## free.sulfur.dioxide   0.051657572 -0.06940835 -0.05065606 -0.15361414
## total.sulfur.dioxide  0.042946836 -0.20565394 -0.18510029 -0.09627567
## density               0.148506412 -0.49617977 -0.17491923  0.67559618
## pH                   -0.196647602  0.20563251 -0.05773139 -0.68348382
## sulphates             1.000000000  0.09359475  0.25139708  0.18160349
## alcohol               0.093594750  1.00000000  0.47616632 -0.06666786
## quality               0.251397079  0.47616632  1.00000000  0.10375373
## FVC.acidity           0.181603491 -0.06666786  0.10375373  1.00000000

Exploring these plots, we can easily see that a ‘good’ wine generally has these trends:

PH VS. Three Acidity

Let’s examine how each acid concentration affects pH.

## Correlation:  -0.7063602

## Correlation:  0.2231154

## Correlation:  -0.7044435

Because we know that pH measures acid concentration(FVC.acidity) using a log sclae, it is not a surprise to find strong correlation between pH and the log of the acid concentration. We can further investigate it by using linear model.

## 
## Call:
## lm(formula = pH ~ log10(FVC.acidity), data = subset(df))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.46869 -0.06359 -0.00024  0.06385  0.48539 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         4.55344    0.03144  144.82   <2e-16 ***
## log10(FVC.acidity) -1.30534    0.03291  -39.66   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1096 on 1597 degrees of freedom
## Multiple R-squared:  0.4962, Adjusted R-squared:  0.4959 
## F-statistic:  1573 on 1 and 1597 DF,  p-value: < 2.2e-16

Now we find that FVC.acidity can only explain half of the variance in pH based on R^2 value. The mean error is relatively bad on poor and excellent wines according to the plot above. So definitely there are other components that affect acidity too.

Sulphates VS. Quality

The above boxplot shows the correlation between sulphates and wine quality. And it’s easy to conclude that better wines seem to have a high concentration of sulphates though there are many outliers in the medium wines.

Alcohol VS. Quality

The correlation here is clear. With the increase of alcohol, the wine tends to have higher quality, especially to the high-end wines.

Density VS. Alcohol

## Correlation:  -0.4909483

The correlation between density and alcohol here makes sense, since we all know that the density of alcohol is smaller than water. So more alcohol means the smaller density (the major component of wine is water) and the two features then should have a negative correlation. That’s exactly the case as showed in the plot.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Firstly I have a look at the correlation among features. And then I further explore the relationships between different features. - pH VS. three acidity(fixed, volatile and citric) and its combined feature–FVC.acidity It would make common sense that higher acidity negatively correlates to pH. However, it would be strange to find that volatile acidity positively correlates to pH, with the correlation equals to 0.223. And it???s easy to say that fixed acidity plays a major role in influencing the pH of one wine. - Sulphates VS. Quality Better wines seem to have a high concentration of sulphates. - Alcohol VS. Quality Better wines tend to have a high concentration of alcohol.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

  • pH VS. volatile acidity It’s a surprise to me that pH and volatile acidity have a positive correlation
  • Density VS. Alcohol Wines that have More alcohol tend to have smaller denstiy and that makes sense.

What was the strongest relationship you found?

  • pH VS. fixed acidity The pH of a wine is negatively and strongly correlated with fixed acidity and the correlation is -0.7063602.

Multivariate Plots Section

Density VS. alcohol, colored by quality

The plot above indicates that the quality of a wine tends to have little relationship with the density. However, good wines seem to have high concentration of alcohol, as discovered in the last plot section.

Fixed.acidity VS. alcohol, colored by quality

The plot above indicates that having high alcohol and a high concentration of fixed acidity seem to produce better wines.

Volatile.acidity VS. alcohol, colored by quality

Clearly, lower volatile acidity and high alcohol can produce better wines. According to wineQualityInfo.txt downloaded from Udacity, volatile acidity is the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste. So it makes sense that good wines tend to have lower volatile acidity.

PH VS. alcohol, colored by quality

The plot above indicates that high alcohol and low pH is a good match to have a good wine.

Sulphates VS. alcohol, colored by quality

It seems that for wines with high alcohol, higher sulphates tend to produce better wines. That’s interesting!

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Good wines seem to have a combination of high concentration of alcohol and fixed acidity and lower pH.

Were there any interesting or surprising interactions between features?

Higher sulphates in wines tend to produce better wines for wines with high alcohol.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.


Final Plots and Summary

Plot One

Description One

The majority of wines have a moderate quality. It’s common since the most consumers buy the medium quality of wines considering the price and quality (High quality means high price).

Plot Two

Description Two

The chart above reveals that alcohol has a great influence on wine quality. It works especially to good quality wines–the wine with the highest quality(8) averagely have a 12% of alcohol in volumn.

Plot Three

Description Three

The plot divided points into three parts based on the quality rating. Holding alcohol concentration constant, wines with higer sulphates are almost always have better quality than wines with lower sulphates.


Reflection

The wine quality data set contains information on 1599 wines across 13 variables. I started by understanding the individual variables in the data set based on the introduction from wineQualityInfo.txt, and then I explored interesting questions and leads as I continued to make observations on plots. I mainly focus on the features that may have a influence on the quality of wines.

During the investigation of these features, we find a ‘good’ wine generally has these trends: 1.higher fixed acidity (tartaric acid) and citric acid, lower volatile acidity(quite surprising); 2.lower pH; 3.higher sulphates; 4.higher alcohol; 5.to a lesser extend, lower chlorides and lower density. I was surprised that lower volatile acidity leads to better wines, and it made sense after I found that volatile acidity means the amount of acetic acid in wine, and too high levels of acetic acid can lead to an unpleasant, vinegar taste. The second surprise was finding that the correltion between volatile acidity(acetic acid) and pH was positive. That’s weird. Possibly because pH is not decided only by volatile acidity–other components such as fixed acidity also play a vital role.

Also, I met some problems. When I tried to explore the correlation between the pH and citric acid, it’s common and rational to compare the two based on the same scale. So I log the citric acid and this process brought a problem. Some values in citric are zero and log0 is meaningless. These values were lost and then I couldn’t compute the correlation between the two due to this situation.

In the next stage of analysing the data set, I would prefer to improve my skills on choosing appropriate plots. And it’s also important to think about the question(the interest of the exploration) from different angles and make more precise conclusions.